Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize comparison operators #269

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

greg7mdp
Copy link

No description provided.

@greg7mdp
Copy link
Author

@chfast, let me know what you find out please :-)

@chfast chfast self-assigned this May 5, 2022
@chfast
Copy link
Owner

chfast commented May 5, 2022

Preliminary benchmarks:

ecmod<addmod_public>_mean                                            +0.0584         +0.0584             7             7             7             7                                                                              
ecmod<addmod_simple>_mean                                            -0.0017         -0.0017            22            22            22            22                                                                              
ecmod<addmod_prenormalize>_mean                                      -0.0999         -0.0999             5             5             5             5                                                                              
ecmod<addmod_daosvik_v1>_mean                                        +0.0210         +0.0210            12            12            12            12                                                                              
ecmod<addmod_daosvik_v2>_mean                                        +0.0103         +0.0103             6             6             6             6                                                                              
ecmod<mulmod>_mean                                                   +0.0051         +0.0051            68            68            68            68                                                                              
shift<uint256, uint256, shl_public>/-1_mean                          -0.0789         -0.0789             3             2             3             2
shift<uint256, uint256, shl_public>/0_mean                           -0.0716         -0.0716             3             3             3             3
shift<uint256, uint256, shl_public>/1_mean                           -0.0020         -0.0020             3             3             3             3
shift<uint256, uint256, shl_public>/2_mean                           -0.0885         -0.0885             3             2             3             2
shift<uint256, uint256, shl_public>/3_mean                           -0.1080         -0.1080             2             2             2             2
shift<uint256, uint256, shl_halves>/-1_mean                          +0.0045         +0.0045             2             2             2             2
shift<uint256, uint256, shl_halves>/0_mean                           +0.0012         +0.0012             3             3             3             3
shift<uint256, uint256, shl_halves>/1_mean                           +0.0016         +0.0016             3             3             3             3
shift<uint256, uint256, shl_halves>/2_mean                           +0.0004         +0.0004             2             2             2             2
shift<uint256, uint256, shl_halves>/3_mean                           -0.0006         -0.0006             2             2             2             2
shift<uint256, uint64_t, shl_public>/-1_mean                         -0.0052         -0.0052             2             2             2             2
shift<uint256, uint64_t, shl_public>/0_mean                          +0.0021         +0.0021             3             3             3             3
shift<uint256, uint64_t, shl_public>/1_mean                          +0.0002         +0.0002             2             2             2             2
shift<uint256, uint64_t, shl_public>/2_mean                          -0.0959         -0.0959             2             2             2             2
shift<uint256, uint64_t, shl_public>/3_mean                          +0.0070         +0.0070             2             2             2             2
shift<uint256, uint64_t, shl_halves>/-1_mean                         -0.0030         -0.0030             2             2             2             2
shift<uint256, uint64_t, shl_halves>/0_mean                          -0.0022         -0.0022             3             3             3             3
shift<uint256, uint64_t, shl_halves>/1_mean                          -0.0017         -0.0017             2             2             2             2
shift<uint256, uint64_t, shl_halves>/2_mean                          -0.0785         -0.0785             2             2             2             2
shift<uint256, uint64_t, shl_halves>/3_mean                          +0.0006         +0.0006             2             2             2             2
shift<uint256, uint64_t, shl_llvm>/-1_mean                           -0.0057         -0.0057             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/0_mean                            -0.0049         -0.0049             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/1_mean                            -0.0056         -0.0056             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/2_mean                            -0.0073         -0.0073             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/3_mean                            -0.0053         -0.0053             6             6             6             6
shift<uint512, uint512, shl_public>/-1_mean                          +0.0005         +0.0005             9             9             9             9
shift<uint512, uint512, shl_public>/0_mean                           +0.0008         +0.0008            10            10            10            10
shift<uint512, uint512, shl_public>/1_mean                           -0.0046         -0.0046             9             9             9             9
shift<uint512, uint512, shl_public>/2_mean                           -0.0028         -0.0028             9             9             9             9
shift<uint512, uint512, shl_public>/3_mean                           +0.0006         +0.0006             8             8             8             8
shift<uint512, uint64_t, shl_public>/-1_mean                         -0.0081         -0.0081             8             8             8             8
shift<uint512, uint64_t, shl_public>/0_mean                          +0.0538         +0.0538             8             9             8             9
shift<uint512, uint64_t, shl_public>/1_mean                          -0.0263         -0.0263             8             8             8             8
shift<uint512, uint64_t, shl_public>/2_mean                          -0.0286         -0.0286             8             8             8             8
shift<uint512, uint64_t, shl_public>/3_mean                          -0.0297         -0.0297             8             8             8             8
compare<lt_public>/64_mean                                           -0.0021         -0.0021             2             2             2             2
compare<lt_public>/128_mean                                          -0.1423         -0.1423             2             2             2             2
compare<lt_public>/192_mean                                          +0.0196         +0.0196             2             2             2             2
compare<lt_public>/256_mean                                          -0.1432         -0.1432             2             1             2             1
compare<lt_sub>/64_mean                                              +0.0092         +0.0092             2             2             2             2
compare<lt_sub>/128_mean                                             +0.0044         +0.0044             2             2             2             2
compare<lt_sub>/192_mean                                             +0.0062         +0.0062             2             2             2             2
compare<lt_sub>/256_mean                                             +0.0078         +0.0078             2             2             2             2
compare<lt_wordcmp>/64_mean                                          +0.0009         +0.0009             2             2             2             2
compare<lt_wordcmp>/128_mean                                         +0.0024         +0.0024             2             2             2             2
compare<lt_wordcmp>/192_mean                                         -0.0349         -0.0349             4             4             4             4
compare<lt_wordcmp>/256_mean                                         +0.0308         +0.0308             3             4             3             4
compare<lt_halves>/64_mean                                           -0.0025         -0.0025             3             3             3             3
compare<lt_halves>/128_mean                                          -0.0021         -0.0021             3             3             3             3
compare<lt_halves>/192_mean                                          -0.0018         -0.0018             3             3             3             3
compare<lt_halves>/256_mean                                          -0.0031         -0.0031             3             3             3             3
compare<lt_llvm>/64_mean                                             -0.0127         -0.0127             2             2             2             2
compare<lt_llvm>/128_mean                                            -0.0117         -0.0117             2             2             2             2
compare<lt_llvm>/192_mean                                            -0.0109         -0.0109             2             2             2             2
compare<lt_llvm>/256_mean                                            -0.0115         -0.0115             2             2             2             2

So this looks solid for compare operators directly and seems to help the shifts as well, although this may be code layout effect.

@chfast chfast changed the title separate change to operator<() in a separate PR for easier testing Optimize comparison operators May 5, 2022
@greg7mdp
Copy link
Author

greg7mdp commented May 5, 2022

Thanks, glad to hear :-)

@chfast
Copy link
Owner

chfast commented May 5, 2022

Confirmed, this does not affect assembly of shift operators. Proof: https://godbolt.org/z/4b6cT4rje.

@greg7mdp
Copy link
Author

greg7mdp commented May 5, 2022

Yes, makes sense. When I played with this repo a while back, I noticed that the benchmarks had variability from one run to the next, so I'm not sure how reliable the comparison truly is.
I do believe that my other changes in add, sub were also beneficial.

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2022

Codecov Report

Merging #269 (f977136) into master (e243bfb) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #269   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines         1941      1937    -4     
=========================================
- Hits          1941      1937    -4     
Flag Coverage Δ
32bit 100.00% <100.00%> (ø)
gcc 99.47% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
include/intx/intx.hpp 100.00% <100.00%> (ø)

@axic
Copy link
Collaborator

axic commented Jun 1, 2022

@chfast should this be merged?

@chfast
Copy link
Owner

chfast commented Jun 1, 2022

On the latest compilers I don't see much difference so I'm not very much in favor because this implementation has 4 branches while the original one is branchless. But people are welcome to drop their own benchmarking results (use "compare" filter).

@greg7mdp
Copy link
Author

greg7mdp commented Jun 1, 2022

Why does it matter that it is branchless if it is slower?

@chfast
Copy link
Owner

chfast commented Aug 31, 2022

I will try to clean up the situation.
We are discussing 3 "less than" implementations:

  • "sub" - subtract two values and return the borrow flag,
  • "ne" - find the top pair of not-equal words, return comparison of these; proposed here by @greg7mdp,
  • "split" - the current one which is the mixture of the two above: split into two 128-bit values and subtract them.

Here you can see the assembly output for each: https://godbolt.org/z/WY3qqzrET
Compilers are doing good job.

Analysis

In the worst case every variant need to load all 8 words. Additionally, "sub" executes 4 subtract instructions (independent of input). The "ne" executes 4 cmp+jmp instead. These 2 instructions are fused to single micro-instruction. So it should perform the same as "sub" in worst case but may perform better if it can "exit" earlier.

Benchmarks

I added all variants of implementations in #277 and benchmarked them on two CPUs. There are 5 benchmark cases:

  • 64, 128, 192, 256 mean the value size in bits. E.g. 64 means we are comparing two 64 bit numbers (i.e. 3 top words are zeros).
  • 0 equal combination of above cases, shuffled.
    There are also some internal changes to benchmarks to improve stability of results.

Haswell

compare<[lt_split vs. lt_ne]>/0_mean                    +0.1471         +0.1471             1             2             1             2
compare<[lt_split vs. lt_ne]>/64_mean                   -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  +0.0013         +0.0013             1             1             1             1
compare<[lt_split vs. lt_ne]>/256_mean                  +0.2908         +0.2908             1             2             1             2
compare<[lt_split vs. lt_sub]>/0_mean                    -0.0345         -0.0345             1             1             1             1
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0827         -0.0827             2             1             2             1
compare<[lt_split vs. lt_sub]>/128_mean                  -0.0818         -0.0818             2             1             2             1
compare<[lt_split vs. lt_sub]>/192_mean                  +0.0992         +0.0992             1             1             1             1
compare<[lt_split vs. lt_sub]>/256_mean                  +0.0978         +0.0978             1             1             1             1
compare<[lt_sub vs. lt_ne]>/0_mean                    +0.1882         +0.1882             1             2             1             2
compare<[lt_sub vs. lt_ne]>/64_mean                   +0.0892         +0.0892             1             2             1             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.0881         +0.0881             1             2             1             2
compare<[lt_sub vs. lt_ne]>/192_mean                  -0.0890         -0.0890             1             1             1             1
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.1758         +0.1758             1             2             1             2

Skylake

compare<[lt_split vs. lt_ne]>/0_mean                    +0.0854         +0.0854             2             2             2             2
compare<[lt_split vs. lt_ne]>/64_mean                   +0.0036         +0.0036             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  +0.1674         +0.1674             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  -0.0026         -0.0027             2             2             2             2
compare<[lt_split vs. lt_ne]>/256_mean                  +0.0057         +0.0057             2             2             2             2
compare<[lt_split vs. lt_sub]>/0_mean                    -0.1562         -0.1562             2             2             2             2
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0004         -0.0004             2             2             2             2
compare<[lt_split vs. lt_sub]>/128_mean                  +0.0004         +0.0004             2             2             2             2
compare<[lt_split vs. lt_sub]>/192_mean                  -0.2086         -0.2086             2             2             2             2
compare<[lt_split vs. lt_sub]>/256_mean                  -0.2122         -0.2122             2             2             2             2
compare<[lt_sub vs. lt_ne]>/0_mean                    +0.2863         +0.2863             2             2             2             2
compare<[lt_sub vs. lt_ne]>/64_mean                   +0.0039         +0.0039             2             2             2             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.1670         +0.1670             2             2             2             2
compare<[lt_sub vs. lt_ne]>/192_mean                  +0.2602         +0.2602             2             2             2             2
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.2766         +0.2766             2             2             2             2

Zen3

This is noisy machine with AMD CPU. For this case I used lt_ne2 for comparison because it was 15% better than the lt_ne.

compare<[lt_split vs. lt_ne2]>/0_mean                    -0.1398         -0.1397             4             3             4             3
compare<[lt_split vs. lt_ne2]>/64_mean                   -0.2668         -0.2668             3             2             3             2
compare<[lt_split vs. lt_ne2]>/128_mean                  -0.1391         -0.1392             3             3             3             3
compare<[lt_split vs. lt_ne2]>/192_mean                  -0.1564         -0.1564             3             3             3             3
compare<[lt_split vs. lt_ne2]>/256_mean                  -0.1487         -0.1487             3             3             3             3
compare<[lt_split vs. lt_sub]>/0_mean                    -0.2012         -0.2010             4             3             4             3
compare<[lt_split vs. lt_sub]>/64_mean                   -0.1363         -0.1366             3             3             3             3
compare<[lt_split vs. lt_sub]>/128_mean                  -0.1508         -0.1507             3             3             3             3
compare<[lt_split vs. lt_sub]>/192_mean                  -0.1555         -0.1555             3             3             3             3
compare<[lt_split vs. lt_sub]>/256_mean                  -0.1456         -0.1456             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/0_mean                    +0.0769         +0.0768             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/64_mean                   -0.1511         -0.1508             3             2             3             2
compare<[lt_sub vs. lt_ne2]>/128_mean                  +0.0137         +0.0136             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/192_mean                  -0.0010         -0.0010             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/256_mean                  -0.0036         -0.0036             3             3             3             3

Xeon Platinum 8272CL

compare<[lt_split vs. lt_ne]>/0_mean                    +0.2017         +0.2016             2             2             2             2
compare<[lt_split vs. lt_ne]>/64_mean                   -0.0078         -0.0073             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  +0.3924         +0.3924             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  -0.0017         -0.0016             2             2             2             2
compare<[lt_split vs. lt_ne]>/256_mean                  -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_sub]>/0_mean                    -0.0803         -0.0803             2             2             2             2
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0051         -0.0046             2             2             2             2
compare<[lt_split vs. lt_sub]>/128_mean                  -0.0021         -0.0020             2             2             2             2
compare<[lt_split vs. lt_sub]>/192_mean                  -0.0025         -0.0025             2             2             2             2
compare<[lt_split vs. lt_sub]>/256_mean                  -0.0023         -0.0023             2             2             2             2
compare<[lt_sub vs. lt_ne]>/0_mean                    +0.3065         +0.3065             2             2             2             2
compare<[lt_sub vs. lt_ne]>/64_mean                   -0.0027         -0.0027             2             2             2             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.3953         +0.3953             2             2             2             2
compare<[lt_sub vs. lt_ne]>/192_mean                  +0.0009         +0.0009             2             2             2             2
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.0015         +0.0015             2             2             2             2

From the results we should rather consider switching back to "sub" implementation. I may do some other benchmarks if I have time.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 11 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants